8 research outputs found
MLS: A Large-Scale Multilingual Dataset for Speech Research
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large
multilingual corpus suitable for speech research. The dataset is derived from
read audiobooks from LibriVox and consists of 8 languages, including about
44.5K hours of English and a total of about 6K hours for other languages.
Additionally, we provide Language Models (LM) and baseline Automatic Speech
Recognition (ASR) models and for all the languages in our dataset. We believe
such a large transcribed dataset will open new avenues in ASR and
Text-To-Speech (TTS) research. The dataset will be made freely available for
anyone at http://www.openslr.org
wav2letter++: The Fastest Open-source Speech Recognition System
This paper introduces wav2letter++, the fastest open-source deep learning
speech recognition framework. wav2letter++ is written entirely in C++, and uses
the ArrayFire tensor library for maximum efficiency. Here we explain the
architecture and design of the wav2letter++ system and compare it to other
major open-source speech recognition systems. In some cases wav2letter++ is
more than 2x faster than other optimized frameworks for training end-to-end
neural networks for speech recognition. We also show that wav2letter++'s
training times scale linearly to 64 GPUs, the highest we tested, for models
with 100 million parameters. High-performance frameworks enable fast iteration,
which is often a crucial factor in successful research and model tuning on new
datasets and tasks
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to
improve access to information for many more people. However, current speech
technology is restricted to about one hundred languages which is a small
fraction of the over 7,000 languages spoken around the world. The Massively
Multilingual Speech (MMS) project increases the number of supported languages
by 10-40x, depending on the task. The main ingredients are a new dataset based
on readings of publicly available religious texts and effectively leveraging
self-supervised learning. We built pre-trained wav2vec 2.0 models covering
1,406 languages, a single multilingual automatic speech recognition model for
1,107 languages, speech synthesis models for the same number of languages, as
well as a language identification model for 4,017 languages. Experiments show
that our multilingual speech recognition model more than halves the word error
rate of Whisper on 54 languages of the FLEURS benchmark while being trained on
a small fraction of the labeled data
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
TorchAudio is an open-source audio and speech processing library built for
PyTorch. It aims to accelerate the research and development of audio and speech
technologies by providing well-designed, easy-to-use, and performant PyTorch
components. Its contributors routinely engage with users to understand their
needs and fulfill them by developing impactful features. Here, we survey
TorchAudio's development principles and contents and highlight key features we
include in its latest version (2.1): self-supervised learning pre-trained
pipelines and training recipes, high-performance CTC decoders, speech
recognition models and training recipes, advanced media I/O capabilities, and
tools for performing forced alignment, multi-channel speech enhancement, and
reference-less speech assessment. For a selection of these features, through
empirical studies, we demonstrate their efficacy and show that they achieve
competitive or state-of-the-art performance
Performance and Efficiency Evaluation of ASR Inference on the Edge
Automatic speech recognition, a process of converting speech signals to text, has improved a great deal in the past decade thanks to the deep learning based systems. With the latest transformer based models, the recognition accuracy measured as word-error-rate (WER), is even below the human annotator error (4%). However, most of these advanced models run on big servers with large amounts of memory, CPU/GPU resources and have huge carbon footprint. This server based architecture of ASR is not viable in the long run given the inherent lack of privacy for user data, reliability and latency issues of the network connection. On the other hand, on-device ASR (meaning, speech to text conversion on the edge device itself) solutions will fix deep-rooted privacy issues while at same time being more reliable and performant by avoiding network connectivity to the back-end server. On-device ASR can also lead to a more sustainable solution by considering the energy vs. accuracy trade-off and choosing right model for specific use cases/applications of the product. Hence, in this paper we evaluate energy-accuracy trade-off of ASR with a typical transformer based speech recognition model on an edge device. We have run evaluations on Raspberry Pi with an off-the-shelf USB meter for measuring energy consumption. We conclude that, in the case of CPU based ASR inference, the energy consumption grows exponentially as the word error rate improves linearly. Additionally, based on our experiment we deduce that, with PyTorch mobile optimization and quantization, the typical transformer based ASR on edge performs reasonably well in terms of accuracy and latency and comes close to the accuracy of server based inference
Performance and Efficiency Evaluation of ASR Inference on the Edge
Automatic speech recognition, a process of converting speech signals to text, has improved a great deal in the past decade thanks to the deep learning based systems. With the latest transformer based models, the recognition accuracy measured as word-error-rate (WER), is even below the human annotator error (4%). However, most of these advanced models run on big servers with large amounts of memory, CPU/GPU resources and have huge carbon footprint. This server based architecture of ASR is not viable in the long run given the inherent lack of privacy for user data, reliability and latency issues of the network connection. On the other hand, on-device ASR (meaning, speech to text conversion on the edge device itself) solutions will fix deep-rooted privacy issues while at same time being more reliable and performant by avoiding network connectivity to the back-end server. On-device ASR can also lead to a more sustainable solution by considering the energy vs. accuracy trade-off and choosing right model for specific use cases/applications of the product. Hence, in this paper we evaluate energy-accuracy trade-off of ASR with a typical transformer based speech recognition model on an edge device. We have run evaluations on Raspberry Pi with an off-the-shelf USB meter for measuring energy consumption. We conclude that, in the case of CPU based ASR inference, the energy consumption grows exponentially as the word error rate improves linearly. Additionally, based on our experiment we deduce that, with PyTorch mobile optimization and quantization, the typical transformer based ASR on edge performs reasonably well in terms of accuracy and latency and comes close to the accuracy of server based inference